Mask pattern for protein language models

ABSTRACT

The technology disclosed relates to accessing a multiple sequence alignment that aligns a query residue sequence to a plurality of non-query residue sequences, applying a set of periodically-spaced masks to a first set of residues at a first set of positions in the multiple sequence alignment, and cropping a portion of the multiple sequence alignment that includes the set of periodically-spaced masks at the first set of positions, and a second set of residues at a second set of positions in the multiple sequence alignment to which the set of periodically-spaced masks is not applied. The first set of residues includes a residue-of-interest at a position-of-interest in the query residue sequence.

PRIORITY APPLICATIONS

This application claims the benefit of and priority to the following:

U.S. Provisional Patent Application No. 63/294,813, titled “PERIODICMASK PATTERN FOR REVELATION LANGUAGE MODELS,” filed Dec. 29, 2021(Attorney Docket No. ILLM 1063-1/IP-2296-PRV);

U.S. Provisional Patent Application No. 63/294,816, titled “CLASSIFYINGMILLIONS OF VARIANTS OF UNCERTAIN SIGNIFICANCE USING PRIMATE SEQUENCINGAND DEEP LEARNING,” filed Dec. 29, 2021 (Attorney Docket No. ILLM1064-1/IP-2297-PRV);

U.S. Provisional Patent Application No. 63/294,820, titled “IDENTIFYINGGENES WITH DIFFERENTIAL SELECTIVE CONSTRAINT BETWEEN HUMANS ANDNON-HUMAN PRIMATES,” filed Dec. 29, 2021 (Attorney Docket No. ILLM1065-1/IP-2298-PRV);

U.S. Provisional Patent Application No. 63/294,827, titled “DEEPLEARNING NETWORK FOR EVOLUTIONARY CONSERVATION,” filed Dec. 29, 2021(Attorney Docket No. ILLM 1066-1/IP-2299-PRV);

U.S. Provisional Patent Application No. 63/294,828, titled “INTER-MODELPREDICTION SCORE RECALIBRATION,” filed Dec. 29, 2021 (Attorney DocketNo. ILLM 1067-1/IP-2301-PRV); and

U.S. Provisional Patent Application No. 63/294,830, titled“SPECIES-DIFFERENTIABLE EVOLUTIONARY PROFILES,” filed Dec. 29, 2021(Attorney Docket No. ILLM 1068-1/IP-2302-PRV).

The priority applications are incorporated by reference as if fully setforth herein.

FIELD OF THE TECHNOLOGY DISCLOSED

The technology disclosed relates to artificial intelligence typecomputers and digital data processing systems and corresponding dataprocessing methods and products for emulation of intelligence (i.e.,knowledge based systems, reasoning systems, and knowledge acquisitionsystems); and including systems for reasoning with uncertainty (e.g.,fuzzy logic systems), adaptive systems, machine learning systems, andartificial neural networks. In particular, the technology disclosedrelates to using neural networks to analyze ordered data.

INCORPORATIONS

The following are incorporated by reference for all purposes as if fullyset forth herein:

Sundaram, L. et al. Predicting the clinical impact of human mutationwith deep neural networks. Nat. Genet. 50, 1161-1170 (2018);

Jaganathan, K. et al. Predicting splicing from primary sequence withdeep learning. Cell 176, 535-548 (2019);

US patent application titled, “PATHOGENICITY LANGUAGE MODEL,” filedcontemporaneously (Attorney Docket No. ILLM 1063-3/IP-2296-US2);

U.S. Patent Application No. 62/573,144, titled “TRAINING A DEEPPATHOGENICITY CLASSIFIER USING LARGE-SCALE BENIGN TRAINING DATA,” filedOct. 16, 2017 (Attorney Docket No. ILLM 1000-1/IP-1611-PRV);

U.S. Patent Application No. 62/573,149, titled “PATHOGENICITY CLASSIFIERBASED ON DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs),” filed Oct. 16, 2017(Attorney Docket No. ILLM 1000-2/IP-1612-PRV);

U.S. Patent Application No. 62/573,153, titled “DEEP SEMI-SUPERVISEDLEARNING THAT GENERATES LARGE-SCALE PATHOGENIC TRAINING DATA,” filedOct. 16, 2017 (Attorney Docket No. ILLM 1000-3/IP-1613-PRV);

U.S. Patent Application No. 62/582,898, titled “PATHOGENICITYCLASSIFICATION OF GENOMIC DATA USING DEEP CONVOLUTIONAL NEURAL NETWORKS(CNNs),” filed Nov. 7, 2017 (Attorney Docket No. ILLM1000-4/IP-1618-PRV);

U.S. patent application Ser. No. 16/160,903, titled “DEEP LEARNING-BASEDTECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS,” filed onOct. 15, 2018 (Attorney Docket No. ILLM 1000-5/IP-1611-US);

U.S. patent application Ser. No. 16/160,986, titled “DEEP CONVOLUTIONALNEURAL NETWORKS FOR VARIANT CLASSIFICATION,” filed on Oct. 15, 2018(Attorney Docket No. ILLM 1000-6/IP-1612-US);

U.S. patent application Ser. No. 16/160,968, titled “SEMI-SUPERVISEDLEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURALNETWORKS,” filed on Oct. 15, 2018 (Attorney Docket No. ILLM1000-7/IP-1613-US);

U.S. patent application Ser. No. 16/160,978, titled “DEEP LEARNING-BASEDSPLICE SITE CLASSIFICATION,” filed on Oct. 15, 2018 (Attorney Docket No.ILLM 1001-4/IP-1680-US);

U.S. patent application Ser. No. 16/407,149, titled “DEEP LEARNING-BASEDTECHNIQUES FOR PRE-TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS,” filedMay 8, 2019 (Attorney Docket No. ILLM 1010-1/IP-1734-US);

U.S. patent application Ser. No. 17/232,056, titled “DEEP CONVOLUTIONALNEURAL NETWORKS TO PREDICT VARIANT PATHOGENICITY USING THREE-DIMENSIONAL(3D) PROTEIN STRUCTURES,” filed on Apr. 15, 2021, (Atty. Docket No. ILLM1037-2/IP-2051-US);

U.S. Patent Application No. 63/175,495, titled “MULTI-CHANNEL PROTEINVOXELIZATION TO PREDICT VARIANT PATHOGENICITY USING DEEP CONVOLUTIONALNEURAL NETWORKS,” filed on Apr. 15, 2021, (Atty. Docket No. ILLM1047-1/IP-2142-PRV);

U.S. Patent Application No. 63/175,767, titled “EFFICIENT VOXELIZATIONFOR DEEP LEARNING,” filed on Apr. 16, 2021, (Atty. Docket No. ILLM1048-1/IP-2143-PRV);

U.S. patent application Ser. No. 17/468,411, titled “ARTIFICIALINTELLIGENCE-BASED ANALYSIS OF PROTEIN THREE-DIMENSIONAL (3D)STRUCTURES,” filed on Sep. 7, 2021, (Atty. Docket No. ILLM1037-3/IP-2051A-US);

U.S. Provisional Patent Application No. 63/253,122, titled “PROTEINSTRUCTURE-BASED PROTEIN LANGUAGE MODELS,” filed Oct. 6, 2021 (AttorneyDocket No. ILLM 1050-1/IP-2164-PRV);

U.S. Provisional Patent Application No. 63/281,579, titled “PREDICTINGVARIANT PATHOGENICITY FROM EVOLUTIONARY CONSERVATION USINGTHREE-DIMENSIONAL (3D) PROTEIN STRUCTURE VOXELS,” filed Nov. 19, 2021(Attorney Docket No. ILLM 1060-1/IP-2270-PRV); and

U.S. Provisional Patent Application No. 63/281,592, titled “COMBINED ANDTRANSFER LEARNING OF A VARIANT PATHOGENICITY PREDICTOR USING GAPED ANDNON-GAPED PROTEIN SAMPLES,” filed Nov. 19, 2021 (Attorney Docket No.ILLM 1061-1/IP-2271-PRV).

BACKGROUND

The subject matter discussed in this section should not be assumed to beprior art merely as a result of its mention in this section. Similarly,a problem mentioned in this section or associated with the subjectmatter provided as background should not be assumed to have beenpreviously recognized in the prior art. The subject matter in thissection merely represents different approaches, which in and ofthemselves can also correspond to implementations of the claimedtechnology.

The explosion of available biological sequence data has led to multiplecomputational approaches that infer the proteins' three-dimensionalstructure, biological function, fitness, and evolutionary history fromsequence data. So-called protein language models, like the ones based onthe Transformer architecture, have been trained on large ensembles ofprotein sequences by using the masked language modeling objective offilling in masked amino acids in a sequence, given the surrounding ones.

Protein language models capture long-range dependencies, learn richrepresentations of protein sequences, and can be employed for multipletasks. For example, protein language models can predict structuralcontacts from single sequences in an unsupervised way.

Protein sequences can be classified into families of homologous proteinsthat descend from an ancestral protein and share a similar structure andfunction. Analyzing multiple sequence alignments (MSAs) of homologousproteins provides important information about functional and structuralconstraints. The statistics of MSA columns, representing amino-acidsites, identify functional residues that are conserved during evolution.Correlations of amino acid usage between the MSA columns containimportant information about functional sectors and structural contacts.

Language models were initially developed for natural language processingand operate on a simple but powerful principle: they acquire linguisticunderstanding by learning to fill in missing words in a sentence, akinto a sentence completion task in standardized tests. Language modelsdevelop powerful reasoning capabilities by applying this principleacross large text corpora. The Bidirectional Encoder Representationsfrom Transformers (BERT) model instantiated this principle usingTransformers, a class of neural networks in which attention is theprimary component of the learning system. In a Transformer, each tokenin the input sentence can “attend” to all other tokens by exchangingactivation patterns corresponding to the intermediate outputs of neuronsin a neural network.

Protein language models like the MSA Transformer have been trained toperform inference from MSAs of evolutionarily related sequences. The MSATransformer interleaves per-sequence (“row”) attention with per-site(“column”) attention to incorporate epistasis. Epistasis leads to aco-evolution of certain protein positions. The effect of mutation at onesite depends on presence or absence of mutations at other sites, whichinfluences mutation. Combinations of row attention heads in the MSATransformer have led to state-of-the-art unsupervised structural contactpredictions.

End-to-end deep learning approaches for variant effect predictions areapplied to predict the pathogenicity of missense variants from proteinsequence and sequence conservation data (See Sundaram, L. et al.Predicting the clinical impact of human mutation with deep neuralnetworks. Nat. Genet. 50, 1161-1170 (2018), referred to herein as“PrimateAI”). PrimateAI uses deep neural networks trained on variants ofknown pathogenicity with data augmentation using cross-speciesinformation. In particular, PrimateAI uses sequences of wild-type andmutant proteins to compare the difference and decide the pathogenicityof mutations using the trained deep neural networks. Such an approachthat utilizes the protein sequences for pathogenicity prediction ispromising because it can avoid the circularity problem and overfittingto previous knowledge. Compared to the adequate number of data to trainthe deep neural networks effectively, the number of clinical dataavailable in ClinVar is relatively small. To overcome this datascarcity, PrimateAI uses common human variants and variants fromprimates as benign data while mutation rate-matched samples ofunlabelled data, based on trinucleotide context, were used as unknowndata.

An opportunity arises to use protein language models and MSAs forvariant pathogenicity prediction. More accurate variant pathogenicityprediction may result.

BRIEF DESCRIPTION OF THE DRAWINGS

In the drawings, like reference characters generally refer to like partsthroughout the different views. Also, the drawings are not necessarilyto scale, with an emphasis instead generally being placed uponillustrating the principles of the technology disclosed. In thefollowing description, various implementations of the technologydisclosed are described with reference to the following drawings, inwhich.

FIG. 1 is a high-level diagram that shows various aspects of thetechnology disclosed, and, in particular, illustrates generating amasked MSA and processing the masked MSA through the disclosed PrimateAIlanguage model to produce a phenotype prediction.

FIG. 2 shows one implementation of applying the disclosedperiodically-spaced mask grid to an MSA and generating the disclosedpartially-masked MSA.

FIG. 3 shows one implementation of one-hot tokens that are defined forthe twenty residue one-hot vectors, the gap residue one-hot vector, andthe mask one-hot vector.

FIG. 4 illustrates one implementation of channel embeddings that aredefined for the twenty residue channel embedding sets, the gap channelembedding set, and the mask channel embedding set.

FIG. 5 shows cropping, padding, and masking of MSAs in accordance withvarious implementations of the technology disclosed.

FIG. 6 depicts one implementation of generating the disclosed MSArepresentation.

FIG. 7 illustrates an example architecture of the disclosed PrimateAIlanguage model.

FIG. 8 shows details of the disclosed mask revelation.

FIG. 9 shows various components of the PrimateAI language model.

FIG. 10 shows one implementation of the disclosed revelation output headused by the disclosed PrimateAI language model.

FIG. 11 is a computer-implemented method of the logic flow of thePrimateAI language model, in accordance with one implementation of thetechnology disclosed.

FIG. 12 is a system that is configured to implement the PrimateAIlanguage model, in accordance with one implementation of the technologydisclosed.

FIG. 13 shows the performance evaluation of the language modelling partof the disclosed PrimateAI language model with other language models.

FIG. 14 depicts the Top-1 training accuracy of the disclosed PrimateAIlanguage model.

FIG. 15 is a computer system that can be used for compilation andruntime execution of the disclosed PrimateAI language model.

FIG. 16 illustrates a comparison between UniRef50 HHblits MSAs and humanHHblits MSAs.

FIG. 17 illustrates the training of the PrimateAI language model usingLAMB optimizer with gradient pre-normalization

DETAILED DESCRIPTION

The following discussion is presented to enable any person skilled inthe art to make and use the technology disclosed and is provided in thecontext of a particular application and its requirements. Variousmodifications to the disclosed implementations will be readily apparentto those skilled in the art, and the general principles defined hereinmay be applied to other implementations and applications withoutdeparting from the spirit and scope of the technology disclosed. Thus,the technology disclosed is not intended to be limited to theimplementations shown but is to be accorded the widest scope consistentwith the principles and features disclosed herein.

The detailed description of various implementations will be betterunderstood when read in conjunction with the appended drawings. To theextent that the figures illustrate diagrams of the functional blocks ofthe various implementations, the functional blocks are not necessarilyindicative of the division between hardware circuitry. Thus, forexample, one or more of the functional blocks (e.g., modules,processors, or memories) may be implemented in a single piece ofhardware (e.g., a general purpose signal processor or a block of randomaccess memory, hard disk, or the like) or multiple pieces of hardware.Similarly, the programs may be stand-alone programs, may be incorporatedas subroutines in an operating system, may be functions in an installedsoftware package, and the like. It should be understood that the variousimplementations are not limited to the arrangements and instrumentalityshown in the drawings.

The processing engines and databases of the figures, designated asmodules, can be implemented in hardware or software, and need not bedivided up in precisely the same blocks as shown in the figures. Some ofthe modules can also be implemented on different processors, computers,or servers, or spread among a number of different processors, computers,or servers. In addition, it will be appreciated that some of the modulescan be combined, operated in parallel or in a different sequence thanthat shown in the figures without affecting the functions achieved. Themodules in the figures can also be thought of as flowchart steps in amethod. A module also need not necessarily have all its code disposedcontiguously in memory; some parts of the code can be separated fromother parts of the code with code from other modules or other functionsdisposed in between.

INTRODUCTION

The disclosed PrimateAI language model uses a masked language modelingobjective for training on sequences. During training, residues atdifferent positions in a sequence are replaced with a mask token and thePrimateAI language model is trained to predict the original residues atthose positions.

Masked language modeling allows training on a large amount of unlabelleddata. Fill-in-the-blank multiple sequence alignment (MSA) Transformerssimultaneously classify multiple masked locations in MSAs duringtraining. Higher numbers of mask locations can add more masked languagemodelling (MLM) gradients that inform optimization, thereby enabling ahigher learning rate and faster training.

However, fill-in-the-blank pathogenicity prediction is fundamentallydifferent from traditional MLM as classification at a mask locationdepends on predicted values of residues at other mask locations. Theclassification scores may often be the averages of conditionalpredictions over all possible combinations of residues at other masklocations.

The PrimateAI language model avoids this averaging by revealing maskedtokens at other mask locations before making predictions. The PrimateAIlanguage model achieves state-of-the-art clinical performance anddenoising accuracy whilst requiring 50× less computation for trainingthan previous MSA Transformers. Various aspects of the technologydisclosed, discussed later, contribute to the 50× reduction in trainingcompute. Examples of such aspects include periodically-spaced mask grid,mask revelation, and the architecture of PrimateAI language model.

The PrimateAI language model can be considered an MSA Transformer forfill-in-the-blank residue classification. In one implementation, thePrimateAI language model is trained end-to-end on MSAs of UniRef50proteins to minimize an unsupervised MLM objective. The PrimateAIlanguage model outputs classification scores for alternative andreference residues, which serve as inputs to the PrimateAIthree-dimensional (3D) rank loss.

Phenotype Prediction

FIG. 1 is a high-level diagram 100 that shows various aspects of thetechnology disclosed, and, in particular, illustrates generating amasked MSA 140 and processing the masked MSA 140 through the disclosedPrimateAI language model (i.e., a phenotype predictor 150 orpathogenicity language model) to produce a phenotype prediction 160.

In one implementation, an MSA dataset 110 includes a multiple sequencealignment (MSA) 120 for each sequence in a UniRef50 database that isretrieved by searching a UniClust30 database. The MSA 120 is analignment of multiple homologous protein sequences to a target protein.From the MSA 120, the degree of homology can be inferred and theevolutionary relationships among the sequences studied. Since realprotein sequences are likely to have insertions, deletions, andsubstitutions, the sequences are aligned by minimizing a Levenshteindistance-like metric over all the sequences. In some implementations,heuristic alignment schemes are used. For example, tools like JackHMMERand HHblits can increase the number and diversity of sequences returnedby iteratively performing the search and alignment steps.

It is difficult to incorporate nearby evolution due to mutationaldifferences in creatures with a recent ancestor being significantlyinfluenced by electromechanical susceptibilities of proteins tomutations. To avoid this, the MSAs used by the technology disclosedcontain diverse proteins that align with the query sequence. Usingdiverse sequences from many species reduces the influence ofelectromechanical susceptibility on predictions as the differences aremore highly determined by natural selection.

In some implementations, the MSA dataset 110 can contain twenty-sixmillion MSAs that are created by using the protein homology detectionsoftware HHblits. In other implementations, an additional set of MSAscan be generated for 19,071 human proteins using HHblits. A personskilled in the art will appreciate that the technology disclosed cansearch, generate, and otherwise leverage (or use) any number of MSAs.

In some implementations, those UniRef50 MSAs can be excluded from theMSA dataset 110 whose query sequences carry rare amino acids, therebyretaining only those MSAs in MSA dataset 110 that contain the twentymost abundant residues. In other implementations, only those non-querysequences can be included in the MSAs that contain the twenty mostcommon residues and gaps, which in turn represent deletions relative tothe query sequence.

In some implementations, the MSAs that are provided as inputs to thePrimateAI language model can have a fixed size of 1024 sequences. Of the1024 sequences, up to 1023 non-query sequences can be randomly sampledfrom the filtered sequences if the MSA depth is larger than 1024. If theMSA depth is less than 1024, the MSA can be padded with zeros to fillthe input. The MSA depth refers to the number of protein sequences inthe MSA. For example, the MSA transformer with a fixed input MSA depthof 1024 sequences can be trained. This eases the process of the modelbecause the tensors input to the model have a fixed shape. If the fullMSA depth is less than 1024, padding can be added to increase its sizeto 1024. If the full MSA depth is more than 1024, 1023 sequences can berandomly sampled from the full MSA depth. The one query sequence can bekept such that the remaining MSA has a depth of 1024 (1023 randomlysampled sequences and 1 query sequence).

A masking logic 130 can apply one or more masks to the MSA 120 andgenerate a masked MSA 140. The masks can be arranged in a periodicmanner, non-periodic manner, regular manner, or irregular manner. Themasks are not limited to periodically-spaced masks or a regular grid orarray of masks. The masks can be irregular in shape, can be straight orcurved, and can be arranged in irregular, non-evenly spaced patterns.The masks are regular in shape when the distance between adjacent masksis fixed or same. The masks are irregular in shape when the distancebetween adjacent masks varies.

The phenotype predictor 150 (e.g., the PrimateAI language model) canprocess the masked MSA 140 and generate the phenotype prediction 160. Inone implementation, the phenotype prediction 160 outputs the identity ofthe masked residues in the masked MSA 140. In other implementations, thephenotype prediction 160 can be used for variant pathogenicityprediction, protein contact map generation, protein functionalityprediction, and so on.

Note that portions of this Application refer to a protein as a“sequence,” “residue sequence,” “amino acid sequence,” and “chain ofamino acids” interchangeably. Also, note that portions of thisApplication use “amino acids” and “residues” interchangeably. Furthernote that portions of this Application use “a set of periodically-spacedmasks,” “periodically-spaced masks,” “mask grid,” “periodically-spacedmask gird,” “periodic mask pattern,” and “fixed mask pattern”interchangeably.

The sequences shown in the figures are protein sequences comprisingamino acid residues. In other implementations, the sequences can insteadcomprise DNA, RNA, carbohydrates, lipids or any other straight orbranched biopolymer.

Having described the technology disclosed at a high level using FIG. 1 ,the discussion now turns to the disclosed periodically-spaced maskgrid—a particular implementation of the masking logic 130.

Periodically-Spaced Mask Grid

FIG. 2 shows one implementation of applying the disclosedperiodically-spaced mask grid 210 to an MSA 220 and generating thedisclosed partially-masked MSA 230.

The columns of the periodically-spaced mask grid 210 correspond toresidue positions. The residue positions are also referred to herein asordinal positions. For example, in FIG. 2 , the periodically-spaced maskgrid 210 has nine columns corresponding to nine residue positions (i.e.,r=9).

The periodically-spaced mask grid 210 has elements (or units or tokens)that are masks. In FIG. 2 , such mask elements are depicted by boxeswith black fill and a “?” symbol). The periodically-spaced mask grid 210also has elements (or units or tokens) that are not masks. In FIG. 2 ,such non-mask elements are depicted by boxes with yellow fill.

The rows of the periodically-spaced mask grid 210 include elements thatare masks and elements that are not masks. The rows of theperiodically-spaced mask grid 210 are referred to herein as maskdistributions. For example, in FIG. 2 , there are five maskdistributions 1-5 (i.e., m mask distributions, where m=5).

Each mask distribution has k periodically-spaced masks. For example, inFIG. 2 , mask distributions 1-4 each have three masks (i.e., k=3), andmask distribution 5 has two masks (i.e., k=2).

The k periodically-spaced masks in a mask distribution are at k ordinalpositions that begin at varying offsets from a first residue position inthe periodically-spaced mask grid 210. For example, in FIG. 2 , the kperiodically-spaced masks of the first mask distribution are located atthe third, the sixth, and the ninth ordinal positions, and begin at anoffset of two from the first residue position in the periodically-spacedmask grid 210. The k periodically-spaced masks of the second maskdistribution are located at the first, the fourth, and the seventhordinal positions, and begin at an offset of zero from the first residueposition in the periodically-spaced mask grid 210. The kperiodically-spaced masks of the third mask distribution are located atthe second, the fifth, and the eighth ordinal positions, and begin at anoffset of one from the first residue position in the periodically-spacedmask grid 210. The k periodically-spaced masks of the fourth maskdistribution are located at the third, the sixth, and the ninth ordinalpositions, and begin at an offset of two from the first residue positionin the periodically-spaced mask grid 210. The k periodically-spacedmasks of the fifth mask distribution are located at the fourth and theseventh ordinal positions, and begin at an offset of three from thefirst residue position in the periodically-spaced mask grid 210.

Masks in the periodically-spaced mask grid 210 are periodic because themasks have regular spacing between them and repeat at regular intervals,i.e., the masks are regularly-spaced repeats. The masks in theperiodically-spaced mask grid 210 are also periodic because the maskshave an ordered pattern.

The masks in the periodically-spaced mask grid 210 can have a latticepattern, a diagonal pattern, a hexagonal pattern, a diamond pattern, arectangle pattern, a square pattern, a triangle pattern, a convexpattern, a concave pattern, and/or a polygonal pattern.

In one implementation, the k periodically-spaced masks of each of themask distributions in the periodically-spaced mask grid 210 have a samestride (e.g., stride=3 in FIG. 2 ). In another implementation, the kperiodically-spaced masks across the mask distributions in theperiodically-spaced mask grid 210 have a diagonal pattern. In otherimplementations, the stride can be any number, such as 16 or in a rangeof 8 to 64 or any number in or subrange of that range. As used herein,the term “stride” refers to the distance between adjacent masks.

In other implementations, the masks in the periodically-spaced mask grid210 are quasi-periodic, such that the masks have an ordered pattern, butthe masks do not recur at precisely regular intervals.

The discussion now turns to FIGS. 3 and 4 to discuss the details of howthe masks are encoded for processing by the PrimateAI language model.After having described FIGS. 3 and 4 , the discussion will return toFIG. 2 to discuss how the disclosed partially-masked MSA is generated.

Masks

A mask token defines the masks. The mask token is configured to concealor replace the original residue in an MSA onto which the mask token isapplied. The mask token is a special or auxiliary token in the sensethat the mask token is different from the twenty residue tokens that areused to define the twenty naturally-occurring residues. The mask tokenis also different from the gap residue token that is used to define thegap residue. The gap residues are those residues whose identities areunresolved (or unknown), and therefore the gap residues are not reliablyclassified to any of the twenty-one known residues. The gap residues areencoded by the gap residue token.

The mask token can be defined by the same encoding logic that definesthe twenty residue tokens and the gap residue token in a way thatencodes the mask token as the twenty-second residue.

FIG. 3 shows one implementation of one-hot tokens 300 that are definedfor-the twenty residue one-hot vectors 301, 302, 303, 304, 305, 306,307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, and320; the gap residue one-hot vector 321; and the mask one-hot vector322. The one-hot tokens 300 are encoded with a binary vector oftwenty-two bits, with one of the bits being hot (i.e., 1) while otherbeing 0. In some implementations, a one-hot encoder (not depicted)generates the one-hot tokens 300.

FIG. 4 illustrates one implementation of channel embeddings 400 (orlearned embeddings) that are defined for-the twenty residue channelembedding sets 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411,412, 413, 414, 415, 416, 417, 418, 419, and 420; the gap channelembedding set 421; and the mask channel embedding set 422. The channelembeddings 400 span the twenty-one known residues. The channel embeddingset 421 spans the gap residues. The mask channel embedding set 422 spansthe mask residues. The channel embeddings 400 are tensors that have aheight dimension, a width dimension, and a depth dimension, and each setof channel embeddings can include N channel embeddings, where N is aninteger like ninety-four. In some implementations, an embeddingsgenerator (not depicted (e.g., a multi-layer perceptron)) generates thechannel embeddings 400.

In some implementations, the embeddings generator can be trained inconjunction with the PrimateAI language model to learn and generate thechannel embeddings 400. During inference, a lookup table can store amapping between the one-hot tokens 300 and the channel embeddings 400.The lookup table can be accessed during the inference to replace theresidue tokens, the gap token, and the mask token with the correspondingchannel embeddings.

In other implementations, the encoding of the mask token (e.g., one-hotor channel embeddings) can vary depending on a variety of factors.Examples include the location (i.e., residue position) of the mask, theresidue-type on which the mask is applied, the sequence-type on whichthe mask is applied, the sequence number on which the mask is applied,and the species-type of the sequence on which the mask is applied.

In other implementations, the mask token can be encoded using otherschemes. Examples include quantitative or numerical data type,qualitative data type, discreet data type, continuous data type (withlower and upper bounds), integer data type (with lower and upperbounds), nominal data type, ordinal or ranked data type, categoricaldata type, interval data type, and ratio data type. For example, theencoding can be based on, or any combination thereof, multiple bits,real values between 0 and 1, continuous values such as floating pointnumbers, Red, Green, Blue (RGB) values between 0 and 256, hexadecimalvalues of CSS colors (e.g., #F0F8FF), categorical color values of CSScolors, respective values of other CSS property groups and properties,size of a particular dimension (e.g., height and width), a set ofdifferent values and data types, and others.

The discussion now returns to FIG. 2 to discuss how the disclosedpartially-masked MSA is generated.

Partially-Masked MSA

The MSA 220 hasp rows and r columns. The p rows correspond top proteinsequences. The r columns correspond to r residue positions (e.g., r=16in FIG. 2 ). The periodically-spaced mask grid 210 can have differentnumber of rows and columns (i.e., a different shape) than the MSA 220.In some implementations, the periodically-spaced mask grid 210 can havea same number of rows and columns (i.e., a same shape) as the MSA 220.

The periodically-spaced mask grid 210 can be applied 212 (or overlaid)anywhere on the MSA 220. For example, the periodically-spaced mask grid210 can be applied such that the periodically-spaced mask grid 210 iscentered at a particular column of the MSA 220 that contains aresidue-of-interest 214 (in red) at a position-of-interest 216 (in red).In another example, the periodically-spaced mask grid 210 can be appliedsuch that the periodically-spaced mask grid 210 is placed at aparticular row (e.g., the query sequence like sequence one in FIG. 2 )of the MSA 220 that contains the residue-of-interest 214 at theposition-of-interest 216.

In one implementation, the periodically-spaced mask grid 210 is appliedto a subset of sequences in the MSA 220, spanning a window of sequences222 (e.g., five sequences in FIG. 2 ). In some implementations, theperiodically-spaced mask grid 210 can be applied on the MSA 220 in aleft-flanking manner or a right-flanking manner. In otherimplementations, the periodically-spaced mask grid 210 can be applied onthe MSA 220 on a portion-by-portion basis, traversing portions (e.g.,quadrants) of the MSA 220 simultaneously or sequentially.

Those residues of the MSA 220 onto which the non-mask elements of theperiodically-spaced mask grid 210 are overlaid remain unchanged and arereferred to herein as the unmasked residues. Conversely, those residuesof the MSA 220 onto which the mask elements of the periodically-spacedmask grid 210 are overlaid change to the mask token and are referred toherein as the masked residues.

A combination or aggregation of the unmasked residues and the maskedresidues forms the partially-masked MSA 230. The partially-masked MSA230 can be defined as an MSA that includes some residues that are notmasked (unmasked) and some residues that are masked. Thepartially-masked MSA 230 can also be defined as an MSA that includessome sequences that contain masked residues and some sequences that donot contain any masked residues.

A portion (or patch) of the partially-masked MSA 230 can be cropped (orselected or extracted) to generate a cropped portion 232 (in blue,dashed outline in FIG. 2 ). In some implementations, the cropped portion232 can include: (i) the masked residues in the window of sequences 222,(ii) some unmasked residues that are contiguously adjacent to the maskedresidues within a neighborhood that coincides with (or defines) aboundary of the cropped portion 232, and (iii) portions of someadditional sequences that extend beyond the window of sequences 222 anddo not contain any masked residues.

MSA Cropping, Padding, and Masking

FIG. 5 shows cropping, padding, and masking of MSAs 500 in accordancewith various implementations of the technology disclosed. In FIG. 5 , aresidue-of-interest at a position-of-interest in the query sequence isindicated by an X, mask locations are indicated by black fill, paddingis indicated by grey fill, and crop regions are indicated by red, dashedlines. In these examples, mask stride is three and cropping window widthis six residues.

In panel A, away from the MSA edges, the position-of-interest is at theright side of the center of a crop region. In panel B, a crop region isshifted to the right of the position-of-interest to avoid going over anMSA edge. In panel C, an MSA for a short protein is padded to fill acrop region. In panel D, a crop region is shifted to the right of theposition-of-interest to minimize padding and the MSA is padded to fillthe crop region.

In some implementations, the position-of-interest is randomly sampledfrom positions in the query sequence during training or chosen by a userduring inference. To maximize information about theposition-of-interest, in some implementations, a cropping window isselected with a size of 256 residues such that the position-of-interestis at the center. However, the cropping window can be shifted if theposition-of-interest is near the edge of an MSA to avoid padding zerosand to increase information about the position-of-interest. If the querysequence is shorter than the cropping window, zeros can be padded tofill the window size.

In some implementations, a smaller probability, p_(sample), is assignedto an MSA being sampled during training if the protein length, L, isshorter than the query sequence, for example,

$p_{sample} \propto {\frac{\max( {{\min( {L,512} )},{64}} )}{512}.}$

This assignment rebalances the distribution of lengths for UniRef50proteins used for training and for human proteins, and also preventswastage of computation on padding.

The UniRef50 proteins used for training often have short sequences,whereas a majority of human proteins has long sequences. FIG. 16illustrates a comparison between UniRef50 HHblits MSAs and human HHblitsMSAs. Many of the proteins in the UniRef50 HHblits MSAs have a shortsequence, while only a few human proteins among MSAs are short.Accordingly, the sampling of longer UniRef50 proteins during trainingcan be increased, such that the sampled distribution of short and longproteins is closer to the distribution of human proteins. Increasing thesampling of long-sequence UniRef50 proteins also increases computationefficiency. When only using short-sequence UniRef50 proteins as input,the input will be padded up to a fixed input shape, which means that thecomputation during the training process would be wasted on paddingrather than adding gradients to the model optimization.

The probability of sampling non-query sequences to be included in thefirst f sequences of an MSA can also be adjusted (e.g., f=32). In oneimplementation, the periodically-spaced mask grid 210 is applied in away that penalizes the occurrences of gaps in the first f sequences. Theprobability, p_(mask), of a non-query sequence being masked decreaseswith increasing number of gap tokens, for example,

$N_{gap},{p_{mask} \propto {\frac{( {L - N_{gap}} )^{2}}{L^{2}}.}}$

Downsampling of sequences with a considerable number of gaps reduces thefraction of missing data in the MSAs.

MSA Representation

FIG. 6 depicts one implementation of generating 600 the disclosed MSArepresentation. Panel A shows the MSA 220. Panel B shows thepartially-masked MSA 230. In this example, the periodically-spaced maskgrid 210 is applied to the first four sequences of the MSA 220 and has astride of three. The partially-masked MSA 230 is generated as a resultof applying the periodically-spaced mask grid 210 to the MSA 220. Inpanel C, the unmasked residues and the masked residues in thepartially-masked MSA 230 are replaced with corresponding ones of thechannel embeddings 400. In one implementation, the corresponding ones ofthe channel embeddings 400 are summed with position embeddings forresidue columns. The position embeddings can be learned and generatedduring the training of the PrimateAI language model. The sum of thecorresponding ones of the channel embeddings 400 and the positionembeddings are divided into chunks 640. In panel D, the chunks 640 areconcatenated in the channel dimension into a stack 660 and then linearlyprojected 670 to form an MSA representation 680. In someimplementations, the linear projection 670 uses a plurality ofone-dimensional (1D) convolution filters.

The channel embeddings 400 are also referred to herein as learnedembeddings. In one implementation, the masked residues and the unmaskedresidues in the partially-masked MSA 230 are translated into the learnedembeddings by using a look-up table that stores learned embeddingscorresponding to the masked residues and the unmasked residues.

The position embeddings are also referred to herein as residue positionembeddings. The sum of the corresponding ones of the channel embeddings400 and the position embeddings is also referred to herein as anembedded representation of the partially-masked MSA 230. The learnedembeddings are concatenated with the residue position embeddings togenerate the embedded representation.

The embedded representation is chunked into the series of chunks 640.The chunks in the series of chunks are concatenated into the stack 660.

The MSA representation 680 is also referred to herein as a projected (orcompressed) representation of the embedded representation. The projectedrepresentation has m rows and r columns. The stack 660 is translatedinto the projected representation by using convolution operations, inaccordance with one implementation. Note that the projectedrepresentation is not compressed at this stage in themaking-data-smaller sense. The projected representation is “compressed”or “smaller” in comparison to the embedded representation if we did notstack rows, which is why row stacking lowers computational requirements.However, the projected representation is not smaller than the modelinput in terms of feature dimensionality.

In one implementation, the fixed mask pattern is applied to the firstthirty-two sequences of MSAs. The MSA tokens are encoded by learned96-channel embeddings, which are summed with learned 96-channel positionembeddings for residue columns before layer normalization. To reducecomputational requirements, embeddings for the 1024 sequences in MSAsare split into thirty-two chunks, each containing thirty-two sequences,at periodic intervals along the sequence axis. These chunks are thenconcatenated in the channel dimension and mixed by linear projection. Inthe context of this application, chunks can be referred to as differentnon-overlapping rows of the MSA. In other implementations, the MSA canbe “chunked” in other ways, such as column-wise, or some other irregularpattern.

PrimateAI Language Model

FIG. 7 illustrates an example architecture 700 of the PrimateAI languagemodel. The PrimateAI language model comprises a cascade ofaxial-attention blocks 710 (e.g., twelve axial-attention blocks). Thecascade of axial-attention blocks 710 takes the MSA representation 680as input and generates an updated MSA representation 720 as output. Eachaxial-attention block comprises residuals that add a tied row-wise gatedself-attention layer 712, a tied column-wise gated self-attention layer714, and a transition layer 716.

In one implementation, there are twelve heads in the tied row-wise gatedself-attention layer 712. In one implementation, there are twelve headsin the tied column-wise gated self-attention layer 714. Each headgenerates sixty-four channels, totaling 768 channels across twelveheads. In one implementation, the transition layer 716 projects up to3072 channels for GELU activation.

The technology disclosed modified axial-gated self-attention to includetied attention, instead of triangle attention. Triangle attention has ahigh computation cost. Tied attention is the sum of dot-productaffinities, between keys and values, across non-padding rows, followedby division by the square root of the number of non-padding rows, whichreduces computational burden substantially.

The discussion now turns to the disclosed mask revelation.

Mask Revelation

The mask revelation reveals unknown values at other mask locations afterthe cascade of axial-attention blocks 710. The mask revelation gathersfeatures aligned with mask sites. For each masked residue in a row, themask revelation reveals embedded target tokens at other masked locationsin that row.

The mask revelation combines the updated 768-channel MSA representation720 with 96-channel target token embeddings 690 at locations indicatedby a Boolean mask 770 which labels positions of mask tokens. The Booleanmask 770, which is a fixed mask pattern with stride 16, is appliedrow-wise to gather features from the MSA representation and target tokenembedding at mask token locations.

Feature gathering reduces row length from 256 to 16, which drasticallydecreases the computational cost of attention blocks that follow maskrevelation. For each location in each row of the gathered MSArepresentation, the row is concatenated with a corresponding row fromthe gathered target token embedding where that location is also maskedin the target token embedding. The MSA representation and partiallyrevealed target embedding are concatenated in the channel dimension andmixed by a linear projection.

After mask revelation 730, the now-informed MSA representation 740 ispropagated though residual row-wise gated self-attention layers 750, 756and a transition layer 754. The attention is only applied to features atmask locations as residues are known for other positions from the MSArepresentation 680 provided as input to the PrimateAI language model.Thus, attention only needs to be applied at mask locations where thereis new information from mask revelation.

After interpretation of the mask revelations by self-attention, a maskedgather operation 760 collects features from the resulting MSArepresentation at positions where target token embeddings remainedmasked. The gathered MSA representation 772 is translated to predictions790 for 21 candidates in the amino acid and gap token vocabulary by anoutput head 780. The output head 780 comprises a transition layer and aperceptron.

FIG. 8 shows details 800 of the disclosed mask revelation. Maskrevelation allows more information during subsequent training improvingthe accuracy of predicting each residue of interest.

The first step is to gather 804, 830, 862 all the tokens at the masklocations 802, 860 marked by the dots. The term gather is used hereinterchangeably with the term aggregate. This is done for tokens in theupdated MSA representation 720, the periodically-spaced mask grid 210,and the embedded representation (embedding tokens) 690.

In FIG. 8 , the dashed lines and colors show how an MSA tile 806 and anembedding tile 844 are selected. Feature gathering reduces row lengthfrom 256 to 16 (6 to 2 in FIG. 8 ), which drastically decreases thecomputational cost of attention blocks that follow mask revelation. Eachof the gathered representations is tiled or replicated/cloned 808, 830,866 by the number of masks in the rows. In the example shown in FIG. 8 ,there are two masks per row. Therefore, there are two tiles that areconcatenated as clones 810 and 870 as a result of cloning 808 and 866,respectively.

Mask revelation 830 is the removal of all the masks in a tile except forthose at a single position. The top tile of the gathered masks is maskedat the first position-of-interest 834 and unmasked at all the otherpositions-of-interest 836. The second tile is masked at the secondposition-of-interest 838 and unmasked at all the otherpositions-of-interest 832. Mask revelation reveals other tokens in a rowfor each masked position in the row. In some implementations, positionsare masked in the same way in both training and inference. This resultsin higher performance than changing to only masking theposition-of-interest during inference. The location of interest'sposition in input chosen to maximize input information because, forexample, when the location of interest is centered at the mask, thenmore of the flanking columns of the MSA are included in the input thatis processed by the PrimateAI language model.

Next, the remaining masks after mask revelation 830 are applied 868 tothe embedding tile 844 to produce cloned and masked embedding tiles 870.The cloned and masked embedding tiles 870 are concatenated 872 with thecloned MSA tiles 810 to generate concatenated tiles 873. Theconcatenated tiles 873 are linearly projected 874 to produce theinformed MSA representation 740.

PrimateAI Language Model Components & Training

FIG. 9 shows various components 900 of the PrimateAI language model, inaccordance with one implementation. The components can include tiedrow-wise gated self-attention, row-wise gate self-attention, andcolumn-wise gated self-attention. The PrimateAI language model can alsouse tied attention. Axial-attention creates independent attention mapsfor each row and column of the input. Sequences in an MSA usually havesimilar three-dimensional structures. Direct coupling analysis exploitsthis fact to learn structural contact information. To leverage thisshared structure, it is beneficial to tie the row attention maps betweenthe sequences in the MSA. As an additional benefit, tied attentionreduces the memory footprint of the row attentions.

In implementations involving recomputation, tied attention reduces thememory footprint of the row attentions from O(ML²) to O(L²). Let M bethe number of rows, d be the hidden dimension and Q_(m), K_(m) be thematrix of queries and keys for the m-th row of input. Tied row attentionis defined, before softmax is applied, to be:

$\frac{\sum_{m = 1}^{M}{Q_{m}K_{m}^{T}}}{\lambda( {M,d} )}$

The final model uses square root normalization. In otherimplementations, the model can also use mean normalization. In suchimplementations, the denominator l(M, d) is the normalization constantAid in standard scaled-dot product attention. In such implementations,for tied row attention, two normalization functions are used to preventattention weights linearly scaling with the number of input sequences:l(M, d)=M√d (mean normalization) and l(M, d)=√Md (square rootnormalization).

In FIG. 9 , dimensions are shown for sequences, s=32, residues, r=256,attention heads, h=12, and channels, c=64 and c_(MSA)=768.

In one implementation, the PrimateAI language model can be trained onfour A100 graphical processing units (GPUs). Optimizer steps are for abatch size of 80 MSAs, which is split over four gradient aggregations tofit batches into 40 GB of A100 memory. The PrimateAI language model istrained with the LAMB optimizer using the following parameters: β_1=0.9,β_2=0.999, ε=10-6, and weight decay of 0.01. Gradients arepre-normalized by division by their global L2 norm before applying theLAMB optimizer. Training is regularized by dropout with probability 0.1,which is applied after activation and before residual connections.

FIG. 17 illustrates the training of the PrimateAI language model usingLAMB optimizer with gradient pre-normalization. Residual blocks arestarted as identity operations, which speeds up convergence and enablesthe PrimateAI language model. “AdamW” refers to ADAM optimizer withweight decay, “ReZeRO” refers to Zero Redundancy Optimizer and “LR”refers to LAMB optimizer with gradient pre-normalization. See, LargeBatch Optimization for Deep Learning Training BERT in 76 minutes, YangYou, Jing Li, Sashank Reddi, et al., International Conference onLearning Representations (ICLR) 2020. As illustrated, the LAMB optimizerwith gradient pre-normalization shows better performance (e.g., higheraccuracy rate over fewer training iterations) and is more effective fora range of learning rates compared to the use of ADAMW optimizer andZero Redundancy Optimizer.

Axial dropout can be applied in self-attention blocks before residualconnections. Post-softmax spatial gating in column-wise attention isfollowed by column-wise dropout while post-softmax spatial gating inrow-wise attention is followed by row-wise dropout. The post-softmaxspatial gating allows for modulation on exponentially normalized scoresor probabilities produced by the softmax.

In one implementation, the PrimateAI language model can be trained for100,000 parameter updates. The learning rate is linearly increased overthe first 5,000 steps from η=5×10⁻⁶ to a peak value of η=5×10⁻⁴, andthen linearly decayed to η=10⁻⁴. Automatic mixed precision (AMP) can beapplied to cast suitable operations from 32-bit to 16-bit precisionduring training and inference. This increases throughput and reducesmemory consumption without affecting performance. In addition, a ZeroRedundancy Optimizer reduced memory usage by sharding optimizer statesacross multiple GPUs.

Revelation Output Head

FIG. 10 shows one implementation of the revelation output head 780 thatcan be used by the disclosed PrimateAI language model. The gathered MSArepresentation 772 can be translated by the output head 780 topredictions 790 for 21 candidates in an amino acid vocabulary includinga gap token. In one implementation, an amino acid vocabulary can beenumerated and the amino acid enumerations are used to index adictionary of learned embeddings. In other implementations, one-hotembeddings of amino acids can be used and combined with linearprojections. In some implementations, the revelation output head 780 cancomprise a transition layer 1002, a gate 1004, a layer normalizationblock 1006, a linear block 1008, a GELU block, and another linear block1012. Dimensions are shown for channels, c_(MSA)=768, and vocabularysize, v=21.

Method

FIG. 11 is a computer-implemented method 1100 of the logic flow of thePrimateAI language model, in accordance with one implementation of thetechnology disclosed.

At action 1102, a multiple sequence alignment (MSA) 220 can be accessed.The MSA can have p rows and r columns. The p rows can correspond topprotein sequences. The r columns can correspond to r residue positions.

At action 1104, a mask grid 210 can be accessed. The mask grid 210 canhave m mask distributions. Each of the m mask distributions can have kperiodically-spaced masks at k ordinal positions that begin at varyingoffsets from a first residue position in the mask grid.

At action 1106, the m mask distributions can be applied to m proteinsequences in the p protein sequences to generate a partially-masked MSA230 that contains masked residues and unmasked residues, where p>m. Invarious implementations, p>=m.

At action 1108, the masked residues and the unmasked residues can betranslated into learned embeddings 400, the learned embeddings 400 canbe concatenated with residue position embeddings to generate an embeddedrepresentation (embedding token) 690 of the partially-masked MSA 230.

At action 1110, the embedded representation 690 can be chunked (orsplit) into a series of chunks 640, chunks in the series of chunks 640can be concatenated into a stack 650, and the stack 650 can betranslated into a compressed representation 680 of the embeddedrepresentation 690. The compressed representation 680 can have m rowsand r columns.

At action 1112, axial-attention 710 can be iteratively (or sequentially)applied across the m rows and the r columns of the compressedrepresentation, and the applied attention can be interleaved (withtransition layers) to generate an updated representation 720 of (orfrom) the compressed representation 680. The updated representation 720can have m rows and r columns.

At action 1114, k updated representation tiles 810 can be aggregatedfrom the updated representation 720. Each of the k updatedrepresentation tiles 810 can contain those updated representationfeatures of the updated representation 720 that correspond to the maskedresidues. Each of the k updated representation tiles can have m rows andk columns. A given column in the k columns of a given updatedrepresentation tile 806 can contain a respective subset of the updatedrepresentation features. The respective subset can be located at a givenordinal position in the k ordinal positions. The given ordinal positioncan be represented by the given column.

At action 1116, k embedding tiles 870 corresponding to the k updatedrepresentation tiles 810 can be aggregated from the embeddedrepresentation 690. Each of the k embedding tiles 844 can contain thoseembedding features in a first chunk of the series of chunks that aretranslations of the masked residues. Each of the k embedding tiles canhave m rows and k columns. A given column in the k columns of a givenembedding tile can contain a respective subset of the embeddingfeatures. The respective subset can be located at a given ordinalposition in the k ordinal positions. The given ordinal position can berepresented by the given column.

At action 1118, k Boolean tiles 834, 838 can be applied to the kembedding tiles to generate k Booleaned (partially revealed) embeddingtiles. Each of the k Boolean tiles can have m rows and k column. Each ofthe k Boolean tiles can cause concealment of a corresponding one of thek columns in a corresponding one of the k embedding tiles, and can causerevelation of other ones of the k columns in the corresponding one ofthe k embedding tiles. Each of the k Booleaned embedding tiles can havem rows and k columns.

At action 1120, the k Booleaned (partially revealed) embedding tiles 870can be concatenated with the k updated representation tiles 810 togenerate k concatenated tiles 873, and the k concatenated tiles 873 canbe translated into k compressed tile representations (informed MSArepresentation 740) of the k concatenated tiles 873. Each of the kcompressed tile representations can have m rows and k columns.

At action 1122, self-attention 750, 754, 756 can be iteratively appliedto the k compressed tile representations 740 to generate interpretationsof those compressed tile features in the k compressed tilerepresentations that correspond to those embedding features in the kembedding tiles that are revealed by the k Boolean tiles.

At action 1124, those interpreted features can be aggregated from theinterpretations that correspond to those embedding features in the kembedding tiles that are concealed by the k Boolean tiles to generate anaggregated representation of the interpretations (gathered MSArepresentation 772). The aggregated representation can have m rows and kcolumns.

At action 1126, the aggregated representation 772 can have translatedinto identities 790 of the masked residues.

System

FIG. 12 is a system 1200 that is configured to implement the PrimateAIlanguage model, in accordance with one implementation of the technologydisclosed.

A memory 1202 can store a multiple sequence alignment (MSA) with aplurality of masked residues.

A chunking logic 1204 can be configured to chunk the MSA into a seriesof chunks.

A first attention logic 1206 can be configured to attend to arepresentation of the series of chunks and produce a first attentionoutput.

A first aggregation logic 1208 can be configured to produce a firstaggregated output that contains those features in the first attentionoutput that correspond to masked residues in the plurality of maskedresidues. The features include elements of an MSA, in oneimplementation, such as one-hot encodings of amino acids in the MSA.

A mask revelation logic 1210 can be configured to produce an informedoutput based on the first aggregated output and a Boolean mask that, ona subset-by-subset basis, alternates between concealing a given subsetof the masked residues and revealing remaining subsets of the maskedresidues.

A second attention logic 1212 can be configured to attend to theinformed output and produce a second attention output based on maskedresidues revealed by the Boolean mask.

A second aggregation logic 1214 can be configured to produce a secondaggregated output that contains those features in the second attentionoutput that correspond to masked residues concealed by the Boolean mask.

An output logic 1216 can be configured to produce identifications of themasked residues based on the second aggregated output.

Objective Indicia of Inventiveness and Non-Obviousness

FIG. 13 shows the performance evaluation 1300 of the language modellingpart of the PrimateAI language model (LM) compared to the replicated VAEpart of EVE (J. Frazer et al., Disease variant prediction with deepgenerative models of evolutionary data. Nature 599, 91-95 (2021)(Evolutionary model of Variant Effect) labelled “EVE*”) model and theircombined score (labelled “PrimateAI LM+EVE*-only”). The performance isfurther compared to a selection of competitive unsupervised methods(ESM1v, SIFT, LIST-S2). In clockwise direction starting from the topleft, the individual panels correspond to evaluation on DDD vs UKBB,Assays, ClinVar, ASD, CHD, DDD and UKBB. For Assays and UKBB, thesummary statistics are given in terms of absolute value (|corr|) ofcorrelation between score and an experimental measure of pathogenicity,i.e., mean phenotype (UKBB) or assays score (Assays). For DDD, wecalculate the P-value of Wilcoxon rank-sum for control and casedistribution over all datasets. For ClinVar, we measure the AUC averagedover all genes.

Evaluation Datasets

Saturation Mutagenesis Assays

Performance of the PrimateAI language model is compared using deepmutational scanning assays for the following 9 genes: Amyloid-beta,YAP1, MSH2, SYUA, VKOR1, PTEN, BRCA1, TP53, and ADRB2. A few assays ofthe genes for which the predication scores of some classifiers areunavailable are excluded from the evaluation analysis, including TPMT,RASH, CALM1, UBE2I, SUMO1, TPK1, and MAPK1. Also excluded are assays ofKRAS (due to different transcript sequence), SLCO1B1 (only 137variants), and Amyloid-beta. Performance of the PrimateAI language modelis evaluated by computing the absolute Spearman rank correlation betweenmodel prediction scores and assay scores individually for each assay andthen taking the mean across all assays.

UK Biobank

The UK Biobank (UKBB) dataset contains 61 phenotypes across 100 genes.Evaluating on common variants of all methods reduces the number to 41phenotypes across 42 genes. The absolute Spearman rank correlation iscalculated between the predicted pathogenicity scores and thequantitative phenotype scores for each pair of gene/phenotype. Onlygene/phenotype pairs with at least 10 variants were included in theevaluation (14 phenotypes across 16 genes). This confirmed that theevaluation is robust to this choice of threshold.

ClinVar

Performance of the PrimateAI language model in classifying clinicallabels of ClinVar missense variants as benign or pathogenic isbenchmarked. Both “benign” and “likely benign” labelled variants areconsidered benign, the same for “pathogenic” and “likely pathogenic”labelled variants (both considered pathogenic). To ensure high-qualitylabels, only ClinVar variants with 1-star review status or above(including “criteria provided, single submitter”, “criteria provided,multiple submitters, no conflicts”, “reviewed by expert panel”,“practice guideline”) are included. This reduced the number of variantsfrom 36,705 to 22,165 for the pathogenic and from 41,986 to 39,560 forthe benign class. The area under the receiver operating characteristiccurve for each gene is calculated and then the mean AUC across all genesis reported.

DDD/ASD/CHD De Novo Missense Variants

To evaluate the performance of the deep learning network in clinicalsettings, de novo mutations from published studies for intellectualdisorders, including autism spectrum disorder (ASD) and developmentaldisorders (DDD) are obtained. ASD contained 2,127 patients with at leastone de novo missense (DNM) mutation. Taken together, there are a totalof 3,135 DNM mutations. This reduced to 517 patients with at least oneDNM variant and a total of 558 DNM variants after requiring all methodshad predictions for those variants. In DDD, 17,952 patients had at leastone de novo missense variant (26,880 variants in total), reducing to5,872 patients (6,398 variants) after requiring availability ofpredictions of all methods. A set of DNM variants from patients withcongenital heart disorders (CHD) are obtained, consisting of 1,839 denovo missense variants from 1,342 patients (reducing to 314 variantsfrom 299 patients after requiring availability of predictions of allmethods). For all the three datasets of de novo variants from affectedpatients, a shared set of DNM variants from healthy controls are used,which contains 1,823 DNM variants from 1,215 healthy controls with atleast one DNM variant and collected from multiple studies. It wasreduced to 250 variants (235 patients) after requiring availability ofvariant prediction scores of all methods. For each disease set of DNMs,the Mann-Whitney U test is applied to evaluate how well each classifiercan distinguish the DNM set of patients from that of controls.

Methods for Comparison

Predictions from other methods were evaluated using rank scoresdownloaded from the database for functional prediction dbNSFP4.2a. Toavoid dramatic reductions in the number of common variants, methods withincomplete sets of scores (methods with less than 67 out of 71 millionpossible missense variants in hg38) are removed, except Polyphen2 due toits widespread adoption. We included the following methods (methodabbreviation) for comparison: BayesDel_noAF (BayesDel), CADD_raw (CADD),DANN, DEOGEN2, LIST-S2, M-CAP, MutationTaster converted(MutationTaster), PROVEAN_converted (PROVEAN), Polyphen2_HVAR(Polyphen2; due to better performance then Polyphen2 HDIV), PrimateAI,Revel (REVEL), SIFT_converted (SIFT), VEST4, fathmm-MKL_coding(fathmm-MKL; highest performance among the fathmm models for givenbenchmarks).

Applying EVE to More Proteins

In the original publication, EVE is only applied to a small set ofdisease-associated genes in ClinVar. To generate the disclosed languagemodel-based training data set, it is essential to expand the predictionsof EVE to as many proteins as possible. Due to unavailability of EVEsource code, a similar method DeepSequence is applied and convertedDeepSequence scores into EVE scores by fitting Gaussian mixture models.An up-to-date version of UniRef100 is used, but otherwise followed thealignment depth and sequence coverage filtering steps described in EVE.At least 1 prediction in 18,920 proteins and a total of 50.2M predictedvariants out of 71.2M possible missense variants are achieve. Tovalidate the disclosed replication, the replicated EVE models areevaluated using published variants from EVE. Scores from the replicatedEVE model result in comparable performance to the published EVE softwareon all benchmarking datasets, e.g., both methods achieve 0.41 meanabsolute correlation on Assays and 0.22 mean absolute correlation forUKBB.

Benchmarking PrimateAI Language Model Against Other Sequence-Only Modelsfor Pathogenicity Predictions

The PrimateAI language model falls into a class of methods only trainedto model proteins sequences but performing surprisingly well aspathogenicity predictors. Despite not achieving the overall bestperformance by themselves, they make crucial features or components inclassifiers incorporating more diverse data. FIG. 13 summarizes theevaluation performance of the PrimateAI language model against othersuch sequence-only methods for pathogenicity prediction: ESM1v, EVE,LIST-S2, and SIFT. Our language model outperforms another language modelESM1v on all the testing datasets except assays using only 1/50^(th) ofthe training time. This is particularly striking as PrimateAI LM doesnot rely on any fine-tuning on assays.

Combining PrimateAI Language Model with EVE

Language models are trained to model the entire universe of proteins.EVE trains a separate model for each human protein and all similarsequences. This and the differences in model architecture and trainingalgorithms suggest that the models extract distinct features from theirinput. Therefore, we expected that the scores from EVE and our languagemodel to be complementary and that combining scores may result inimproved performance. We found that simply taking the mean of theirpathogenicity scores already performs better than any of the two methodsalone. More elaborate combinations, e.g., using ridge regression, didnot lead to any further improvements. The resulting performance is shownin FIG. 13 , where the combined score leads to a performance gain of6.6% (or 6.8%) in mean correlation on assays compared to the PrimateAILM (or compared to replicated EVE), 1.4% (or 1.7%) improvement mean AUCon ClinVar and increases in P-value by 11% (29%) for DDD, 3% (26%) forASD and 17% (23%) for CHD.

Top-1 Training Accuracy

FIG. 14 depicts the Top-1 training accuracy 1400 of the PrimateAIlanguage model. An ensemble of six PrimateAI language model networks wastrained with different random seeds for training data sampling and modelparameter initialization. Their top-1 accuracies during training areshown in FIG. 14 for mask locations in the query sequence and allsequences in UniRef50 MSAs. Top-1 accuracy for the query sequence ismuch lower than for all sequences as the query sequence does not containgap tokens, which are easier to predict than residues because gap tokensoften form long and contiguous segments in MSAs. The PrimateAI languagemodel accuracy on query sequences continues to improve with training. Insome implementations, convergence can be accelerated by adding auxiliarylosses to each layer of the PrimateAI language model.

Entropy and Pathogenicity Score

Scores of the PrimateAI language model can be tabulated for futurereference, rather than re-running the model every time its scores areneeded. For example, the PrimateAI language model's fill-in-the-blankpredictions can provided for locations of interest at every site in19,071 human proteins, totaling predictions for 2,057,437,040 variantsat 108,286,160 positions. A person skilled in the art will appreciatethat these numbers would change, for example, if the small number ofhuman proteins that were not included here were included. In someimplementations, the PrimateAI language model can be ensembled toproduced averaged scores that have higher performance than individualmodel scores. For example, each prediction can be made by an ensemble ofsix models, with each model contributing at least four inferences withdifferent random seeds for sampling and ordering of sequences in humanMSAs. Inferences logits can be averaged by taking means of predictionsgrouped by random seed, and then taking the mean of the means.

Pathogenicity prediction of a variant can be evaluated using therelative values of logits for reference and alternative amino acids, orevaluated by subtracting the logit value for the reference amino acidfrom the logit value for the alternative amino acid. The probabilitiesare normalized over all possible residues disregarding the gap token,such that Σ_(r)p_(r)=1 with probability p_(r) of the r^(th) residueobtained from the ensembled logits. The log difference captures howunlikely the variant amino acid is compared to the reference amino acid.However, the score does not consider the prediction of the other 18possible amino acids, which contain information about the languagemodels internal estimate of protein site conservation as well asconvergence of the language model. The entropy was used evaluated overamino acid predictions S=−Σ_(r)p_(r) log (p_(r)) with probability p_(r)of the r^(th) residue to capture a variant agnostic site-dependentcontribution to the pathogenicity score. Specifically, a score, s_(alt),for the alternative residue at a given site is given by the usual logdifference of the alt and reference logit at that site minus the entropyover amino acids at the given site, i.e., s_(alt)=log (p_(art))−log(p_(ref))−S.

The entropy term is small whenever the probability over all amino acidsis dominated by a single term and large whenever the model is uncertainabout the residues and assigns multiple residues high values.Physically, in this case the site is associated with little conservationand likely to mutate. This should lead to less pathogenic signal.Adjusting the scores by entropy incorporates a model internal estimateof amino acid conservation. A given log difference between residue andreference will be considered as more pathogenic whenever it isassociated with a highly conserved site. The score adjustmentadditionally incorporates the lack of convergence associated with aheavily undertrained model.

“Logic” (e.g., masking logic), as used herein, can be implemented in theform of a computer product including a non-transitory computer readablestorage medium with computer usable program code for performing themethod steps described herein. The “logic” can be implemented in theform of an apparatus including a memory and at least one processor thatis coupled to the memory and operative to perform exemplary methodsteps. The “logic” can be implemented in the form of means for carryingout one or more of the method steps described herein; the means caninclude (i) hardware module(s), (ii) software module(s) executing on oneor more hardware processors, or (iii) a combination of hardware andsoftware modules; any of (i)-(iii) implement the specific techniques setforth herein, and the software modules are stored in a computer readablestorage medium (or multiple such media). In one implementation, thelogic implements a data processing function. The logic can be a generalpurpose, single core or multicore, processor with a computer programspecifying the function, a digital signal processor with a computerprogram, configurable logic such as an FPGA with a configuration file, aspecial purpose circuit such as a state machine, or any combination ofthese. Also, a computer program product can embody the computer programand configuration file portions of the logic.

Computer System

FIG. 15 shows a computer system 1500 that can be used for compilationand runtime execution of the PrimateAI language model. Computer system1500 includes at least one central processing unit (CPU) 1572 thatcommunicates with a number of peripheral devices via bus subsystem 1555.These peripheral devices can include a storage subsystem 1510 including,for example, memory devices and a file storage subsystem 1536, userinterface input devices 1538, user interface output devices 1576, and anetwork interface subsystem 1574. The input and output devices allowuser interaction with computer system 1500. Network interface subsystem1574 provides an interface to outside networks, including an interfaceto corresponding interface devices in other computer systems.

In one implementation, the pathogenicity predictor 150 (e.g., thePrimateAI language model) is communicably linked to the storagesubsystem 1510 and the user interface input devices 1538.

User interface input devices 1538 can include a keyboard; pointingdevices such as a mouse, trackball, touchpad, or graphics tablet; ascanner; a touch screen incorporated into the display; audio inputdevices such as voice recognition systems and microphones; and othertypes of input devices. In general, use of the term “input device” isintended to include all possible types of devices and ways to inputinformation into computer system 1500.

User interface output devices 1576 can include a display subsystem, aprinter, a fax machine, or non-visual displays such as audio outputdevices. The display subsystem can include an LED display, a cathode raytube (CRT), a flat-panel device such as a liquid crystal display (LCD),a projection device, or some other mechanism for creating a visibleimage. The display subsystem can also provide a non-visual display suchas audio output devices. In general, use of the term “output device” isintended to include all possible types of devices and ways to outputinformation from computer system 1500 to the user or to another machineor computer system.

Storage subsystem 1510 stores programming and data constructs thatprovide the functionality of some or all of the modules and methodsdescribed herein. These software modules are generally executed byprocessors 1578.

Processors 1578 can be graphics processing units (GPUs),field-programmable gate arrays (FPGAs), application-specific integratedcircuits (ASICs), and/or coarse-grained reconfigurable architectures(CGRAs). Processors 1578 can be hosted by a deep learning cloud platformsuch as Google Cloud Platform™, Xilinx™, and Cirrascale™. Examples ofprocessors 1578 include Google's Tensor Processing Unit (TPU)™,rackmount solutions like GX4 Rackmount Series™, GX15 Rackmount Series™,NVIDIA DGX-1™, Microsoft, Stratix V FPGA™, Graphcore's IntelligentProcessor Unit (IPU)™, Qualcomm's Zeroth Platform™ with SnapdragonProcessors™, s Volta™, NVIDIA's DRIVE PX™, s JETSON TX1/TX2 MODULE™,Intel's Nirvana™, Movidius VPU™, Fujitsu DPI™, ARM's DynamiclQ™, IBMTrueNorth™, Lambda GPU Server with Testa V100s™, and others.

Memory subsystem 1522 used in the storage subsystem 1510 can include anumber of memories including a main random access memory (RAM) 1532 forstorage of instructions and data during program execution and a readonly memory (ROM) 1534 in which fixed instructions are stored. A filestorage subsystem 1536 can provide persistent storage for program anddata files, and can include a hard disk drive, a floppy disk drive alongwith associated removable media, a CD-ROM drive, an optical drive, orremovable media cartridges. The modules implementing the functionalityof certain implementations can be stored by file storage subsystem 1536in the storage subsystem 1510, or in other machines accessible by theprocessor.

Bus subsystem 1555 provides a mechanism for letting the variouscomponents and subsystems of computer system 1500 communicate with eachother as intended. Although bus subsystem 1555 is shown schematically asa single bus, alternative implementations of the bus subsystem can usemultiple busses.

Computer system 1500 itself can be of varying types including a personalcomputer, a portable computer, a workstation, a computer terminal, anetwork computer, a television, a mainframe, a server farm, awidely-distributed set of loosely networked computers, or any other dataprocessing system or user device. Due to the ever-changing nature ofcomputers and networks, the description of computer system 1500 depictedin FIG. 15 is intended only as a specific example for purposes ofillustrating the preferred implementations of the present invention.Many other configurations of computer system 1500 are possible havingmore or less components than the computer system depicted in FIG. 15 .

Clauses

The technology disclosed can be practiced as a system, method, orarticle of manufacture. One or more features of an implementation can becombined with the base implementation. Implementations that are notmutually exclusive are taught to be combinable. One or more features ofan implementation can be combined with other implementations. Thisdisclosure periodically reminds the user of these options. Omission fromsome implementations of recitations that repeat these options should notbe taken as limiting the combinations taught in the precedingsections—these recitations are hereby incorporated forward by referenceinto each of the following implementations.

One or more implementations and clauses of the technology disclosed orelements thereof can be implemented in the form of a computer product,including a non-transitory computer-readable storage medium withcomputer usable program code for performing the method steps indicated.Furthermore, one or more implementations and clauses of the technologydisclosed or elements thereof can be implemented in the form of anapparatus including a memory and at least one processor that is coupledto the memory and operative to perform exemplary method steps. Yetfurther, in another aspect, one or more implementations and clauses ofthe technology disclosed or elements thereof can be implemented in theform of means for carrying out one or more of the method steps describedherein; the means can include (i) hardware module(s), (ii) softwaremodule(s) executing on one or more hardware processors, or (iii) acombination of hardware and software modules; any of (i)-(iii) implementthe specific techniques set forth herein, and the software modules arestored in a computer-readable storage medium (or multiple such media).

The clauses described in this section can be combined as features. Inthe interest of conciseness, the combinations of features are notindividually enumerated and are not repeated with each base set offeatures. The reader will understand how features identified in theclauses described in this section can readily be combined with sets ofbase features identified as implementations in other sections of thisapplication. These and other features, aspects, and advantages of thetechnology disclosed will become apparent from the following detaileddescription of illustrative implementations thereof, which is to be readin connection with the accompanying drawings. These clauses are notmeant to be mutually exclusive, exhaustive, or restrictive; and thetechnology disclosed is not limited to these clauses but ratherencompasses all possible combinations, modifications, and variationswithin the scope of the claimed technology and its equivalents.

Other implementations of the clauses described in this section caninclude a non-transitory computer-readable storage medium storinginstructions executable by a processor to perform any of the clausesdescribed in this section. Yet another implementation of the clausesdescribed in this section can include a system including memory and oneor more processors operable to execute instructions, stored in thememory, to perform any of the clauses described in this section.

We disclose the following clauses:

Clause Set 1

1. A computer-implemented method of variant pathogenicity prediction,including:

-   accessing a multiple sequence alignment that aligns a query residue    sequence to a plurality of non-query residue sequences;-   applying a set of periodically-spaced masks to a first set of    residues at a first set of positions in the multiple sequence    alignment, wherein the first set of residues includes a    residue-of-interest at a position-of-interest in the query residue    sequence;-   cropping a portion of the multiple sequence alignment that includes    -   (i) the set of periodically-spaced masks at the first set of        positions, and    -   (ii) a second set of residues at a second set of positions in        the multiple sequence alignment to which the set of        periodically-spaced masks is not applied; and-   generating a pathogenicity prediction for a variant at the    position-of-interest based on the portion of the multiple sequence    alignment.    2. The computer-implemented method of clause 1, wherein the multiple    sequence alignment aligns the query residue sequence to the    plurality of non-query residue sequences along a per-position    dimension and along a per-sequence dimension.    3. The computer-implemented method of clause 2, wherein the set of    periodically-spaced masks is applied along the per-sequence    dimension within a window of sequences in the multiple sequence    alignment.    4. The computer-implemented method of clause 3, wherein the set of    periodically-spaced masks is applied along the per-position    dimension within a window of positions across the window of    sequences in the multiple sequence alignment.    5. The computer-implemented method of clause 4, wherein the portion    spans the window of positions across the multiple sequence    alignment.    6. The computer-implemented method of clause 4, wherein the portion    spans the window of positions across a subset of sequences in the    multiple sequence alignment.    7. The computer-implemented method of clause 1, wherein the portion    has a predetermined width and a predetermined height.    8. The computer-implemented method of clause 7, wherein the portion    is padded to compensate for multiple sequence alignments that have    widths smaller the predetermined width of the portion.    9. The computer-implemented method of clause 7, wherein the portion    is padded to compensate for multiple sequence alignments that have    heights smaller the predetermined heights of the portion.    10. The computer-implemented method of clause 2, wherein the set of    periodically-spaced masks is distributed along the per-sequence    dimension into subsets of periodically-spaced masks.    11. The computer-implemented method of clause 10, wherein the    subsets of periodically-spaced masks correspond to sequences in the    window of sequences.    12. The computer-implemented method of clause 11, wherein successive    masks in a subset of periodically-spaced masks corresponding to a    given sequence in the window of sequences are spaced apart by    unmasked residues in the given sequence.    13. The computer-implemented method of clause 12, wherein a number    of the unmasked residues by which the successive masks are spaced    apart is same across the sequences in the window of sequences.    14. The computer-implemented method of clause 12, wherein a number    of the unmasked residues by which the successive masks are spaced    apart varies across the sequences in the window of sequences.    15. The computer-implemented method of clause 12, wherein a starting    position in a given sequence at which a corresponding subset of    periodically-spaced masks begins varies between the sequences in the    window of sequences.    16. The computer-implemented method of clause 12, wherein the    starting position follows a diagonal pattern across the sequences in    the window of sequences.    17. The computer-implemented method of clause 14, wherein the    starting position follows a diagonal pattern that begins to repeat    at least once across the sequences in the window of sequences.    18. The computer-implemented method of clause 17, wherein the    starting position follows a diagonal pattern that repeats at least    once across the sequences in the window of sequences.    19. The computer-implemented method of clause 1, wherein the set of    periodically-spaced masks has a pattern.    20. The computer-implemented method of clause 19, wherein the    pattern is a diagonal pattern.    21. The computer-implemented method of clause 19, wherein the    pattern is a hexagonal pattern.    22. The computer-implemented method of clause 19, wherein the    pattern is a diamond pattern.    23. The computer-implemented method of clause 19, wherein the    pattern is a rectangle pattern.    24. The computer-implemented method of clause 19, wherein the    pattern is a square pattern.    25. The computer-implemented method of clause 19, wherein the    pattern is a triangle pattern.    26. The computer-implemented method of clause 19, wherein the    pattern is a convex pattern.    27. The computer-implemented method of clause 19, wherein the    pattern is a concave pattern.    28. The computer-implemented method of clause 19, wherein the    pattern is a polygonal pattern.    29. The computer-implemented method of clause 19, further including    right-shifting a cropping window used for the cropping to minimize    padding of the portion.    30. The computer-implemented method of clause 29, further including    left-shifting the cropping window to minimize the padding of the    portion.    31. The computer-implemented method of clause 1, further including    configuring the cropping window to position the position-of-interest    in a center column of the portion.    32. The computer-implemented method of clause 31, further including    configuring the cropping window to position the position-of-interest    adjacent to the center column.    33. The computer-implemented method of clause 1, further including    substituting, in the portion, the set of periodically-spaced masks    at the first set of positions with learned mask embeddings, and    substituting, in the portion, and the second set of residues at the    second set of positions with learned residue embeddings.    34. The computer-implemented method of clause 33, wherein a one-hot    encoding generator generates the learned mask embeddings and the    learned residue embeddings.    35. The computer-implemented method of clause 34, wherein the    learned mask embeddings and the learned residue embeddings are    selected from a look-up table.    36. The computer-implemented method of clause 1, further including    substituting, in the portion, the set of periodically-spaced masks    at the first set of positions and the second set of residues at the    second set of positions with learned position embeddings.    37. The computer-implemented method of clause 36, further including    chunking the portion with the learned mask embeddings, the learned    residue embeddings, and the learned position embeddings into a    plurality of chunks.    38. The computer-implemented method of clause 37, further including    processing the plurality of chunks as an aggregate and generating an    alternative representation of the portion.    39. The computer-implemented method of clause 38, wherein a linear    projection layer uses a filter bank of 1×1 convolutions to process    the plurality of chunks as the aggregate and generate the    alternative representation of the portion.    40. The computer-implemented method of clause 39, further including    processing the alternative representation of the portion through a    cascade of attention blocks to generate an updated alternative    representation of the portion.    41. The computer-implemented method of clause 40, wherein attention    blocks in the cascade of attention blocks use self-attention.    42. The computer-implemented method of clause 41, wherein each of    the attention blocks includes a tied row-wise gate self-attention,    followed by a column-wise gated self-attention, and followed by a    transition logic.    43. The computer-implemented method of clause 40, wherein the    attention blocks use cross-attention.    44. The computer-implemented method of clause 40, wherein a mask    revelation block processes the updated alternative representation of    the portion and generates an informed alternative representation of    the portion.    45. The computer-implemented method of clause 44, wherein the mask    revelation block gathers features aligned with masked locations in a    row, and for each mask in the row reveals, embedded target tokens at    other masked locations in the row.    46. The computer-implemented method of clause 44, wherein a mask    gather block processes the informed alternative representation of    the portion and generates a gathered alternative representation of    the portion.    47. The computer-implemented method of clause 46, wherein the mask    gather block processes the informed alternative representation    through a cascade of transition logic and row-wise gated    self-attention blocks that gather features where target embeddings    remained masked.    48. The computer-implemented method of clause 47, wherein an output    block processes the gathered alternative representation of the    portion and predicts identities of residues masked by the set of    periodically-spaced masks.    49. The computer-implemented method of clause 48, wherein the output    block includes a transition logic and a perceptron logic.    50. The computer-implemented method of clause 48, wherein a    probability of applying a subset of periodically-spaced masks to a    non-sequence in the window of sequences is proportional to (1−a    number of gap tokens in the non-sequence)−2.    51. The computer-implemented method of clause 1, further including    generating the pathogenicity prediction for the variant based on a    difference between a log probability of the variant and a log    probability of a corresponding reference amino acid less an entropy    evaluated over amino acid-wise predictions.

Clause Set 2

1. A computer-implemented method, including:accessing a multiple sequence alignment (MSA), wherein the MSA hasp rowsand r columns, wherein the p rows correspond top protein sequences, andwherein the r columns correspond to r residue positions;accessing a mask grid, wherein the mask grid has m mask distributions,and wherein each of the m mask distributions has k periodically-spacedmasks at k ordinal positions that begin at varying offsets from a firstresidue position in the mask grid;applying the m mask distributions to m protein sequences in the pprotein sequences to generate a partially-masked MSA that containsmasked residues and unmasked residues, where p>m;translating the masked residues and the unmasked residues into learnedembeddings, concatenating the learned embeddings with residue positionembeddings to generate an embedded representation of thepartially-masked MSA;chunking the embedded representation into a series of chunks,concatenating chunks in the series of chunks into a stack, andtranslating the stack into a compressed representation of the embeddedrepresentation, wherein the compressed representation has m rows and rcolumns;iteratively applying axial-attention across the m rows and the r columnsof the compressed representation, and interleaving the applied attentionto generate an updated representation of the compressed representation,wherein the updated representation has m rows and r columns;aggregating, from the updated representation, k updated representationtiles, wherein each of the k updated representation tiles contains thoseupdated representation features of the updated representation thatcorrespond to the masked residues, wherein each of the k updatedrepresentation tiles has m rows and k columns, wherein a given column inthe k columns of a given updated representation tile contains arespective subset of the updated representation features, wherein therespective subset is located at a given ordinal position in the kordinal positions, and wherein the given ordinal position is representedby the given column;aggregating, from the embedded representation, k embedding tilescorresponding to the k updated representation tiles, wherein each of thek embedding tiles contains those embedding features in a first chunk ofthe series of chunks that are translations of the masked residues,wherein each of the k embedding tiles has m rows and k columns, whereina given column in the k columns of a given embedding tile contains arespective subset of the embedding features, wherein the respectivesubset is located at a given ordinal position in the k ordinalpositions, and wherein the given ordinal position is represented by thegiven column;applying k Boolean tiles to the k embedding tiles to generate kBooleaned embedding tiles, wherein each of the k Boolean tiles has mrows and k columns, wherein each of the k Boolean tiles causesconcealment of a corresponding one of the k columns in a correspondingone of the k embedding tiles, and causes revelation of other ones of thek columns in the corresponding one of the k embedding tiles, and whereineach of the k Booleaned embedding tiles has m rows and k columns;concatenating the k Booleaned embedding tiles with the k updatedrepresentation tiles to generate k concatenated tiles, and translatingthe k concatenated tiles into k compressed tile representations of the kconcatenated tiles, wherein each of the k compressed tilerepresentations has m rows and k columns;iteratively applying self-attention to the k compressed tilerepresentations to generate interpretations of those compressed tilefeatures in the k compressed tile representations that correspond tothose embedding features in the k embedding tiles that are revealed bythe k Boolean tiles;aggregating those interpreted features from the interpretations thatcorrespond to those embedding features in the k embedding tiles that areconcealed by the k Boolean tiles to generate an aggregatedrepresentation of the interpretations, wherein the aggregatedrepresentation has m rows and k columns; andtranslating the aggregated representation into identities of the maskedresidues.2. The computer-implemented of clause 1, further including using aone-hot encoding scheme to translate twenty naturally-occurringresidues, a gap residue, and a mask into respective one-hot encodedvectors.3. The computer-implemented of clause 2, further including training aneural network to generate respective learned embeddings for therespective one-hot encoded vectors.4. The computer-implemented of clause 3, wherein the masked residues andthe unmasked residues are translated into the learned embeddings basedon a lookup table that maps the respective one-hot encoded vectors tothe respective learned embeddings.5. The computer-implemented of clause 4, wherein the residue positionembeddings specify an order in which residues are arranged in the pprotein sequences.6. The computer-implemented of clause 1, wherein the chunks areconcatenated into the stack along a channel dimension.7. The computer-implemented of clause 1, wherein the stack is translatedinto the compressed representation by processing the stack through alinear projection.8. The computer-implemented of clause 7, wherein the linear projectionuses a plurality of one-dimensional (1D) convolution filters.9. The computer-implemented of clause 8, wherein the k concatenatedtiles are translated into the k compressed tile representations byprocessing the k concatenated tiles through the linear projection.10. The computer-implemented of clause 1, wherein the aggregatedrepresentation is translated into the identities of the masked residuesby processing the aggregated representation through a revelation outputhead.11. The computer-implemented of clause 1, wherein p=m.12. The computer-implemented of clause 1, wherein each of the k Booleantiles causes concealment of the corresponding one of the k columns inthe corresponding one of the k embedding tiles, and causes revelation ofat least some of the other ones of the k columns in the correspondingone of the k embedding tiles.13. The computer-implemented of clause 1, wherein each of the k Booleantiles causes concealment of a corresponding subset of the k columns inthe corresponding one of the k embedding tiles, and causes revelation ofat least some of the other ones of the k columns in the correspondingone of the k embedding tiles.14. The computer-implemented of clause 1, wherein the kperiodically-spaced masks of at least some of the m mask distributionsbegin at a same offset from the first residue position.15. A system, comprising:memory storing a multiple sequence alignment (MSA) with a plurality ofmasked residues;chunking logic configured to chunk the MSA into a series of chunks;first attention logic configured to attend to a representation of theseries of chunks and produce a first attention output;first aggregation logic configured to produce a first aggregated outputthat contains those features in the first attention output thatcorrespond to masked residues in the plurality of masked residues;mask revelation logic configured to produce an informed output based onthe first aggregated output and a Boolean mask that, on asubset-by-subset basis, alternates between concealing a given subset ofthe masked residues and revealing remaining subsets of the maskedresidues;second attention logic configured to attend to the informed output andproduce a second attention output based on masked residues revealed bythe Boolean mask;second aggregation logic configured to produce a second aggregatedoutput that contains those features in the second attention output thatcorrespond to masked residues concealed by the Boolean mask; andoutput logic configured to produce identifications of the maskedresidues based on the second aggregated output.16. The system of clause 15, wherein the first attention logic usesaxial-attention.17. The system of clause 15, wherein the second attention logic usesself-attention.18. A computer-implemented method, including:accessing a multiple sequence alignment (MSA), wherein the MSA hasp rowsand r columns, wherein the p rows correspond top protein sequences, andwherein the r columns correspond to r residue positions;accessing a mask grid, wherein the mask grid has m mask distributions,and wherein each of the m mask distributions has k periodically-spacedmasks at k ordinal positions;applying the m mask distributions to m protein sequences in the pprotein sequences to generate a partially-masked MSA that containsmasked residues and unmasked residues, where p>m;translating the masked residues and the unmasked residues into learnedembeddings, concatenating the learned embeddings with residue positionembeddings to generate an embedded representation of thepartially-masked MSA;chunking the embedded representation into a series of chunks,concatenating chunks in the series of chunks into a stack, andtranslating the stack into a compressed representation of the embeddedrepresentation;iteratively applying axial-attention across the m rows and the r columnsof the compressed representation, and interleaving the applied attentionto generate an updated representation of the compressed representation;aggregating, from the updated representation, k updated representationtiles, wherein each of the k updated representation tiles contains thoseupdated representation features of the updated representation thatcorrespond to the masked residues;aggregating, from the embedded representation, k embedding tilescorresponding to the k updated representation tiles, wherein each of thek embedding tiles contains those embedding features in a first chunk ofthe series of chunks that are translations of the masked residues;applying k Boolean tiles to the k embedding tiles to generate kBooleaned embedding tiles, wherein each of the k Boolean tiles causesconcealment of a corresponding one of the k columns in a correspondingone of the k embedding tiles, and causes revelation of other ones of thek columns in the corresponding one of the k embedding tiles;concatenating the k Booleaned embedding tiles with the k updatedrepresentation tiles to generate k concatenated tiles, and translatingthe k concatenated tiles into k compressed tile representations of the kconcatenated tiles;iteratively applying self-attention to the k compressed tilerepresentations to generate interpretations of those compressed tilefeatures in the k compressed tile representations that correspond tothose embedding features in the k embedding tiles that are revealed bythe k Boolean tiles;aggregating those interpreted features from the interpretations thatcorrespond to those embedding features in the k embedding tiles that areconcealed by the k Boolean tiles to generate an aggregatedrepresentation of the interpretations; andtranslating the aggregated representation into identities of the maskedresidues.19. The computer-implemented of clause 18, wherein the kperiodically-spaced masks of at least some of the m mask distributionsbegin at varying offsets from a first residue position in mask grid.20. The computer-implemented of clause 19, wherein the kperiodically-spaced masks of at least some of the m mask distributionsbegin at a same offset from the first residue position.21. The computer-implemented of clause 18, wherein the compressedrepresentation has m rows and r columns.22. The computer-implemented of clause 18, wherein the updatedrepresentation has m rows and r columns.23. The computer-implemented of clause 18, wherein each of the k updatedrepresentation tiles has m rows and k columns, wherein a given column inthe k columns of a given updated representation tile contains arespective subset of the updated representation features, wherein therespective subset is located at a given ordinal position in the kordinal positions, and wherein the given ordinal position is representedby the given column.24. The computer-implemented of clause 18, wherein each of the kembedding tiles has m rows and k columns, wherein a given column in thek columns of a given embedding tile contains a respective subset of theembedding features, wherein the respective subset is located at a givenordinal position in the k ordinal positions, and wherein the givenordinal position is represented by the given column.25. The computer-implemented of clause 18, wherein each of the k Booleantiles has m rows and k columns.26. The computer-implemented of clause 18, wherein each of the kBooleaned embedding tiles has m rows and k columns.27. The computer-implemented of clause 18, wherein each of the kcompressed tile representations has m rows and k columns.28. The computer-implemented of clause 18, wherein the aggregatedrepresentation has m rows and k columns.

What is claimed is:
 1. A computer-implemented method of variantpathogenicity prediction, including: accessing a multiple sequencealignment that aligns a query residue sequence to a plurality ofnon-query residue sequences; applying a set of periodically-spaced masksto a first set of residues at a first set of positions in the multiplesequence alignment, wherein the first set of residues includes aresidue-of-interest at a position-of-interest in the query residuesequence; cropping a portion of the multiple sequence alignment thatincludes (i) the set of periodically-spaced masks at the first set ofpositions, and (ii) a second set of residues at a second set ofpositions in the multiple sequence alignment to which the set ofperiodically-spaced masks is not applied; and generating a pathogenicityprediction for a variant at the position-of-interest based on theportion of the multiple sequence alignment.
 2. The computer-implementedmethod of claim 1, wherein the multiple sequence alignment aligns thequery residue sequence to the plurality of non-query residue sequencesalong a per-position dimension and along a per-sequence dimension. 3.The computer-implemented method of claim 2, wherein the set ofperiodically-spaced masks is applied along the per-sequence dimensionwithin a window of sequences in the multiple sequence alignment.
 4. Thecomputer-implemented method of claim 3, wherein the set ofperiodically-spaced masks is applied along the per-position dimensionwithin a window of positions across the window of sequences in themultiple sequence alignment.
 5. The computer-implemented method of claim1, wherein the portion has a predetermined width and a predeterminedheight.
 6. The computer-implemented method of claim 5, wherein theportion is padded to compensate for multiple sequence alignments thathave widths smaller the predetermined width of the portion.
 7. Thecomputer-implemented method of claim 2, wherein the set ofperiodically-spaced masks is distributed along the per-sequencedimension into subsets of periodically-spaced masks.
 8. Thecomputer-implemented method of claim 7, wherein the subsets ofperiodically-spaced masks correspond to sequences in a window ofsequences.
 9. The computer-implemented method of claim 1, wherein theset of periodically-spaced masks has a pattern.
 10. Thecomputer-implemented method of claim 9, further including right-shiftinga cropping window used for the cropping to minimize padding of theportion.
 11. The computer-implemented method of claim 10, furtherincluding left-shifting the cropping window to minimize the padding ofthe portion.
 12. The computer-implemented method of claim 1, furtherincluding configuring a cropping window to position theposition-of-interest in a center column of the portion.
 13. Thecomputer-implemented method of claim 12, further including configuringthe cropping window to position the position-of-interest adjacent to thecenter column.
 14. The computer-implemented method of claim 1, furtherincluding substituting, in the portion, the set of periodically-spacedmasks at the first set of positions with learned mask embeddings, andsubstituting, in the portion, and the second set of residues at thesecond set of positions with learned residue embeddings.
 15. Thecomputer-implemented method of claim 14, further including substituting,in the portion, the set of periodically-spaced masks at the first set ofpositions and the second set of residues at the second set of positionswith learned position embeddings.
 16. The computer-implemented method ofclaim 15, further including chunking the portion with learned maskembeddings, the learned residue embeddings, and the learned positionembeddings into a plurality of chunks.
 17. The computer-implementedmethod of claim 16, further including processing the plurality of chunksas an aggregate and generating an alternative representation of theportion.
 18. The computer-implemented method of claim 1, furtherincluding generating the pathogenicity prediction for the variant basedon a difference between a log probability of the variant and a logprobability of a corresponding reference amino acid less an entropyevaluated over amino acid-wise predictions.
 19. A system including oneor more processors coupled to memory, the memory loaded with computerinstructions to predict variant pathogenicity, the instructions, whenexecuted on the one or more processors, implement actions comprising:accessing a multiple sequence alignment that aligns a query residuesequence to a plurality of non-query residue sequences; applying a setof periodically-spaced masks to a first set of residues at a first setof positions in the multiple sequence alignment, wherein the first setof residues includes a residue-of-interest at a position-of-interest inthe query residue sequence; cropping a portion of the multiple sequencealignment that includes (i) the set of periodically-spaced masks at thefirst set of positions, and (ii) a second set of residues at a secondset of positions in the multiple sequence alignment to which the set ofperiodically-spaced masks is not applied; and generating a pathogenicityprediction for a variant at the position-of-interest based on theportion of the multiple sequence alignment.
 20. A non-transitorycomputer readable storage medium impressed with computer programinstructions to predict variant pathogenicity, the instructions, whenexecuted on a processor, implement actions comprising: accessing amultiple sequence alignment that aligns a query residue sequence to aplurality of non-query residue sequences; applying a set ofperiodically-spaced masks to a first set of residues at a first set ofpositions in the multiple sequence alignment, wherein the first set ofresidues includes a residue-of-interest at a position-of-interest in thequery residue sequence; cropping a portion of the multiple sequencealignment that includes (i) the set of periodically-spaced masks at thefirst set of positions, and (ii) a second set of residues at a secondset of positions in the multiple sequence alignment to which the set ofperiodically-spaced masks is not applied; and generating a pathogenicityprediction for a variant at the position-of-interest based on theportion of the multiple sequence alignment.